17 research outputs found

    Automatsko raspoznavanje hrvatskoga govora velikoga vokabulara

    Get PDF
    This paper presents procedures used for development of a Croatian large vocabulary automatic speech recognition system (LVASR). The proposed acoustic model is based on context-dependent triphone hidden Markov models and Croatian phonetic rules. Different acoustic and language models, developed using a large collection of Croatian speech, are discussed and compared. The paper proposes the best feature vectors and acoustic modeling procedures using which lowest word error rates for Croatian speech are achieved. In addition, Croatian language modeling procedures are evaluated and adopted for speaker independent spontaneous speech recognition. Presented experiments and results show that the proposed approach for automatic speech recognition using context-dependent acoustic modeling based on Croatian phonetic rules and a parameter tying procedure can be used for efļ¬cient Croatian large vocabulary speech recognition with word error rates below 5%.Članak prikazuje postupke akustičkog i jezičnog modeliranja sustava za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara. Predloženi akustički modeli su zasnovani na kontekstno-ovisnim skrivenim Markovljevim modelima trifona i hrvatskim fonetskim pravilima. Na hrvatskome govoru prikupljenom u korpusu su ocjenjeni i uspoređeni različiti akustički i jezični modeli. U članku su uspoređ eni i predloženi postupci za izračun vektora značajki za akustičko modeliranje kao i sam pristup akustičkome modeliranju hrvatskoga govora s kojim je postignuta najmanja mjera pogreÅ”no raspoznatih riječi. Predstavljeni su rezultati raspoznavanja spontanog hrvatskog govora neovisni o govorniku. Postignuti rezultati eksperimenata s mjerom pogreÅ”ke ispod 5% ukazuju na primjerenost predloženih postupaka za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara pomoću vezanih kontekstnoovisnih akustičkih modela na osnovu hrvatskih fonetskih pravila

    Automatic Intonation Event Detection Using Tilt Model for Croatian Speech Synthesis

    Get PDF
    Text-to-speech systems convert text into speech. Synthesized speech without prosody sounds unnatural and monotonous. In order to sound natural, prosodic elements have to be implemented. The generation of prosodic elements directly from text is a rather demanding task. Our final goals are building a complete prosodic model for Croatian and implementing it into our TTS system. In this work, we present one of the steps in implementation of prosody into TTSs ā€“ detection of intonation events using Tilt intonation model. We propose a training procedure which is composed of several subtasks. First, we hand-labelled a set of utterances and within each of them, marked four types of prosodic events. Then we trained HMMs and used them to mark prosodic events on a larger set of utterances. We estimate parameters for each of the intonation event and generated f0 contours from the parameters. Finally, we evaluated the obtained f0 contours

    TEXT-TO-SPEECH SYNTHESIS: A PROTOTYPE SYSTEM FOR CROATIAN LANGUAGE

    Get PDF
    U radu je prikazan sustav koji omogućuje umjetnu tvorbu hrvatskoga govora prema proizvoljnom ulaznom tekstu. Ulazni tekst, koji mora biti u normaliziranom obliku, sustav pretvara u niz fonema (pretvorba grafem-fonem), a zatim stvara zvučni zapis na temelju fonetskoga niza. KoriÅ”teni postupak sinteze temelji se na ulančavanju manjih akustičkih jedinica govora ā€“ difona metodom TD-PSOLA. Za potrebe sustava izrađena je i baza difona za hrvatski govor. Predložen je automatski postupak odabira difona iz govornoga korpusa. Kvaliteta ostvarenoga postupka ispitana je provođenjem ankete među ispitanicima. Ispitanici su dali subjektivnu ocjenu kvalitete dobivenoga govora, a time je provjerena i njegova razumljivost.This paper presents the development of a Croatian text-to-speech system capable of synthesizing speech from arbitrary text. Input text in normalized form is first transcribed into a phonetic string (grapheme-to-phoneme conversion) and then processed by a TD-PSOLA based synthesizer. A procedure for automatic selection of diphones from a spoken corpus is proposed. A Croatian language diphone database was built for the system. Subjective quality evaluations of the resulting speech were performed, as well as tests for intelligibility

    Application of Deep Learning Methods for Detection and Tracking of Players

    Get PDF
    This chapter deals with the application of deep learning methods in sports scenes for the purpose of detecting and tracking the athletes and recognizing their activities. The scenes recorded during handball games and training activities will be used as an example. Handball is a team sport played with the ball with well-defined goals and rules, with a given number of players who can participate in the game as well as their roles. Athletes move quickly throughout the field during the game, change position and roles from defensive to offensive, use different techniques and actions, and very often are partially or completely occluded by another athlete. If artificial lighting and cluttered background are additionally taken into account, it is clear that these are very challenging tasks for object detectors and trackers. The chapter will present the results of various experiments that include player and ball detection using state-of-the-art deep convolutional neural networks such as YOLO v3 or Mask R-CNN, player tracking using Deep Sort, key player determination using activity measures, and action recognition using LSTM. In the conclusion, open issues and challenges in applying deep learning methods in such a dynamic sports environment will be discussed

    TEXT-TO-SPEECH SYNTHESIS: A PROTOTYPE SYSTEM FOR CROATIAN LANGUAGE

    Get PDF
    U radu je prikazan sustav koji omogućuje umjetnu tvorbu hrvatskoga govora prema proizvoljnom ulaznom tekstu. Ulazni tekst, koji mora biti u normaliziranom obliku, sustav pretvara u niz fonema (pretvorba grafem-fonem), a zatim stvara zvučni zapis na temelju fonetskoga niza. KoriÅ”teni postupak sinteze temelji se na ulančavanju manjih akustičkih jedinica govora ā€“ difona metodom TD-PSOLA. Za potrebe sustava izrađena je i baza difona za hrvatski govor. Predložen je automatski postupak odabira difona iz govornoga korpusa. Kvaliteta ostvarenoga postupka ispitana je provođenjem ankete među ispitanicima. Ispitanici su dali subjektivnu ocjenu kvalitete dobivenoga govora, a time je provjerena i njegova razumljivost.This paper presents the development of a Croatian text-to-speech system capable of synthesizing speech from arbitrary text. Input text in normalized form is first transcribed into a phonetic string (grapheme-to-phoneme conversion) and then processed by a TD-PSOLA based synthesizer. A procedure for automatic selection of diphones from a spoken corpus is proposed. A Croatian language diphone database was built for the system. Subjective quality evaluations of the resulting speech were performed, as well as tests for intelligibility

    A Croatian Weather Domain Spoken Dialog System Prototype

    Get PDF
    Speech technologies and language technologies have been already in use in IT for a certain time. Because of their great impact and fast growth, it is necessary to introduce these technologies for Croatian language. In this paper we propose a solution for developing a domain-oriented spoken dialog system for Croatian language. We have chosen a weather domain because it has limited vocabulary, it has easily accessible data and it is highly applicable. The Croatian weather dialog system provides information about weather in different regions of Croatia. The modules of the spoken dialog system perform automatic word recognition, semantic analysis, dialog management, response generation and text-to-speech synthesis. This is a first attempt to develop such a system for Croatian language and some new approaches are presented

    Croatian speech synthesis based on unit selection and stochastic models

    No full text
    Govor je čovjeku prirodan način komunikacije. Govorne tehnologije poput sinteze i automatskog raspoznavanja govora te automastkog vođenja dijaloga omogućavaju govornu komunikaciju sa strojevima i raznim uređajima poput pametnih telefona i televizora. Govorno sučelje pri koriÅ”tenju takvih uređaja može u mnogim situacijama biti prikladnije od koriÅ”tenja tipkovnice i ekrana, primjerice u vožnji dok korisnik mora imati slobodne ruke i oči. Kako bi upotreba tih uređaja bila Å”to prirodnija i predstavljala Å”to manje opterećenje, od govornih tehnologija se očekuju sve bolje performanse te stoga njihov razvoj postaje sve važniji. U ovom radu u srediÅ”tu pažnje je razvoj sustava za sintezu hrvatskoga govora koji omogućuje automatsku pretvorbu proizvoljnog teksta u govorni oblik. Za izgradnju sustava koriÅ”tene su metode odabira jedinica i statističke parametarske sinteze te je predložena hibridna arhitektura koja objedinjuje obje metode. Govor dobiven pomoću statističke parametarske sinteze govora zvuči razumljivo i obično ima ujednačenu kvalitetu, no veću prirodnost je moguće ostvariti metodom odabira jedinica. Međutim, kod sinteze odabirom jedinica čak i mali broj jedinica koje se loÅ”e povezuju s ostalima u lancu mogu znatno naruÅ”iti dojam kvalitete. Stoga se u predloženoj hibridnoj metodi predlaže koriÅ”tenje stohastičkih modela F0 za odbacivanje nizova koji sadrže jedinice koje prema modelu imaju premalenu vjerojatnost. Provedena je subjektivna evaluacija kvalitete, razumljivosti, prirodnosti i pojava nepravilnosti pri govoru razvijenih sustava za sintezu govora. Za slučaj sinteze tekstova unutar domene korpusa za učenje najbolje je ocijenjena sinteza odabirom jedinica grupiranjem, dok je za tekstove izvan domene najbolje ocijenjen hibridni sustav. Za automatsku objektivnu evaluaciju razumljivosti umjetnog govora predložena je mjera temeljena na rezultatima automatskog raspoznavanja govora koja je koriÅ”tena za optimiranje parametara hibridnog sustava. Govor koji je točno automatski raspoznat i sluÅ”aoci su ocijenili boljim čime se potvrđuje opravdanost koriÅ”tenja predložene objektivne mjere za optimiranje sustava za sintezu govora.Speech is a most natural mode of communication to people. Speech technologies such as speech synthesis, automatic speech recognition and spoken dialogue management enable spoken communication with machines and devices such as smartphones and entertainment devices. In many situations, for example when driving, spoken-language interface can be more appropriate and practical than using a keyboard and screen. In order to make the use of spoken-language interfaces as natural and convenient as possible, increasingly better performance is expected from speech technologies so their continued development is becoming more important. In this work the focus is on development of a speech synthesis system for Croatian language. A hybrid architecture based on unit selection and statistical parametric synthesis is proposed for the system. Speech generated using statistical parametric synthesizer sounds intelligible and usually has a consistent quality and speech generated using unit selection can sound more natural. However, in unit selection speech synthesis, even a small number of units that do not join well with other units in a chain can significantly degrade the perceived quality of synthesis. Therefore, a hybrid synthesis method is proposed where stochastic models of fundamental frequency are used to discard those candidate unit chains for synthesis that contain units that have a low probability according to the model. The thesis is composed of 8 chapters. In the first chapter the motivation and goals of the work are presented. Chapter 2 gives an overview of the state of the art and previous research. Principles of unit selection and statistical parametric synthesis are given, as well as formant synthesis which can be considered a predecessor of the statistical parametric synthesis. Hybrid approaches that combine ideas from unit selection and statistical parametric synthesis are described next. The chapter concludes with an overview of the work specific for the speech synthesis in Croatian. In Chapter 3 unit selection and statistical parametric speech synthesis methods, used in the developed system, are presented in more detail. Chapter 4 presents the speech corpus that was used in development of the speech synthesis system. A procedure for selection of a phonetically rich subset from a larger set of text is described. The procedure is applied to a large collection of Croatian text, described in this chapter, and a subset is selected that is small enough to be practical for recording, and allows synthesis of an arbitrary utterance. Construction of a speech unit database from speech recordings and corresponding transcriptions for use in the system is described last. A procedure for objective evaluation of algorithms for fundamental frequency (F0) estimation is described in Chapter 5. F0 is an important parameter for modelling speech and thus its accurate estimation from natural speech is important in speech synthesis system construction. In the proposed objective evaluation procedure, algorithms are tested on synthetic speech, and F0 values estimated by the tested algorithm are compared with known referent values used for synthesis. Six F0 estimation algorithms are tested and compared on male and female synthetic speech. The architecture of the developed speech synthesis system is presented in Chapter 6. The system is composed of two basic subsystems, the linguistic analysis subsystem and the speech synthesis subsystem. Three variants of the speech synthesis subsystems differing in the method of speech synthesis are developed. In the first the unit selection method is used, where synthetic speech is generated by concatenating units of natural speech. In the second, statistical parametric method is used, where speech is generated using stochastic models of speech. In the third, hybrid approach, a method for unit selection is proposed where potential candidate unit chains are scored according to statistical models of speech. In the conventional unit selection approach, a target pronunciation is set first, and then a chain of units from the database is selected that best fits the target according to a cost function. However, if there are no chain of units in the database that fit the target pronunciation it is still possible that a different pronunciation that can be realised with available units would still sound natural. In the hybrid system, this is achieved using a two-step unit selection procedure, where in the first step candidate chains are selected primarily based on cost of joining consecutive units, while in the second step final selection is made using statistical models of fundamental frequency. In Chapter 7 the results of evaluation of the synthetic speech are presented. A formal subjective evaluation of speech quality, intelligibility, naturalness and appearance of irregularities in speech was conducted. For texts in domain of learning corpus, a variant of the system based on unit selection was rated best, while for out-of-domain texts the best score was achieved for the hybrid system. Subjective evaluation can be inconvenient to perform in various stages of system development, since it can take long, be expensive and results may vary between runs. Objective evaluation procedures that can be done quickly and with consistent results between runs are favoured in that case. In this work, a measure based on automatic speech recognition (ASR) was proposed for automatic objective evaluation of speech intelligibility and was applied for the problem of parameter optimization of the hybrid system. A subjective evaluation confirmed a correspondence between the results of automatic recognition and human perception, and an improvement in synthetic speech quality after optimization. The thesis concludes with Chapter 8 where the contributions of the thesis and possible future work is presented

    Croatian speech synthesis based on unit selection and stochastic models

    No full text
    Govor je čovjeku prirodan način komunikacije. Govorne tehnologije poput sinteze i automatskog raspoznavanja govora te automastkog vođenja dijaloga omogućavaju govornu komunikaciju sa strojevima i raznim uređajima poput pametnih telefona i televizora. Govorno sučelje pri koriÅ”tenju takvih uređaja može u mnogim situacijama biti prikladnije od koriÅ”tenja tipkovnice i ekrana, primjerice u vožnji dok korisnik mora imati slobodne ruke i oči. Kako bi upotreba tih uređaja bila Å”to prirodnija i predstavljala Å”to manje opterećenje, od govornih tehnologija se očekuju sve bolje performanse te stoga njihov razvoj postaje sve važniji. U ovom radu u srediÅ”tu pažnje je razvoj sustava za sintezu hrvatskoga govora koji omogućuje automatsku pretvorbu proizvoljnog teksta u govorni oblik. Za izgradnju sustava koriÅ”tene su metode odabira jedinica i statističke parametarske sinteze te je predložena hibridna arhitektura koja objedinjuje obje metode. Govor dobiven pomoću statističke parametarske sinteze govora zvuči razumljivo i obično ima ujednačenu kvalitetu, no veću prirodnost je moguće ostvariti metodom odabira jedinica. Međutim, kod sinteze odabirom jedinica čak i mali broj jedinica koje se loÅ”e povezuju s ostalima u lancu mogu znatno naruÅ”iti dojam kvalitete. Stoga se u predloženoj hibridnoj metodi predlaže koriÅ”tenje stohastičkih modela F0 za odbacivanje nizova koji sadrže jedinice koje prema modelu imaju premalenu vjerojatnost. Provedena je subjektivna evaluacija kvalitete, razumljivosti, prirodnosti i pojava nepravilnosti pri govoru razvijenih sustava za sintezu govora. Za slučaj sinteze tekstova unutar domene korpusa za učenje najbolje je ocijenjena sinteza odabirom jedinica grupiranjem, dok je za tekstove izvan domene najbolje ocijenjen hibridni sustav. Za automatsku objektivnu evaluaciju razumljivosti umjetnog govora predložena je mjera temeljena na rezultatima automatskog raspoznavanja govora koja je koriÅ”tena za optimiranje parametara hibridnog sustava. Govor koji je točno automatski raspoznat i sluÅ”aoci su ocijenili boljim čime se potvrđuje opravdanost koriÅ”tenja predložene objektivne mjere za optimiranje sustava za sintezu govora.Speech is a most natural mode of communication to people. Speech technologies such as speech synthesis, automatic speech recognition and spoken dialogue management enable spoken communication with machines and devices such as smartphones and entertainment devices. In many situations, for example when driving, spoken-language interface can be more appropriate and practical than using a keyboard and screen. In order to make the use of spoken-language interfaces as natural and convenient as possible, increasingly better performance is expected from speech technologies so their continued development is becoming more important. In this work the focus is on development of a speech synthesis system for Croatian language. A hybrid architecture based on unit selection and statistical parametric synthesis is proposed for the system. Speech generated using statistical parametric synthesizer sounds intelligible and usually has a consistent quality and speech generated using unit selection can sound more natural. However, in unit selection speech synthesis, even a small number of units that do not join well with other units in a chain can significantly degrade the perceived quality of synthesis. Therefore, a hybrid synthesis method is proposed where stochastic models of fundamental frequency are used to discard those candidate unit chains for synthesis that contain units that have a low probability according to the model. The thesis is composed of 8 chapters. In the first chapter the motivation and goals of the work are presented. Chapter 2 gives an overview of the state of the art and previous research. Principles of unit selection and statistical parametric synthesis are given, as well as formant synthesis which can be considered a predecessor of the statistical parametric synthesis. Hybrid approaches that combine ideas from unit selection and statistical parametric synthesis are described next. The chapter concludes with an overview of the work specific for the speech synthesis in Croatian. In Chapter 3 unit selection and statistical parametric speech synthesis methods, used in the developed system, are presented in more detail. Chapter 4 presents the speech corpus that was used in development of the speech synthesis system. A procedure for selection of a phonetically rich subset from a larger set of text is described. The procedure is applied to a large collection of Croatian text, described in this chapter, and a subset is selected that is small enough to be practical for recording, and allows synthesis of an arbitrary utterance. Construction of a speech unit database from speech recordings and corresponding transcriptions for use in the system is described last. A procedure for objective evaluation of algorithms for fundamental frequency (F0) estimation is described in Chapter 5. F0 is an important parameter for modelling speech and thus its accurate estimation from natural speech is important in speech synthesis system construction. In the proposed objective evaluation procedure, algorithms are tested on synthetic speech, and F0 values estimated by the tested algorithm are compared with known referent values used for synthesis. Six F0 estimation algorithms are tested and compared on male and female synthetic speech. The architecture of the developed speech synthesis system is presented in Chapter 6. The system is composed of two basic subsystems, the linguistic analysis subsystem and the speech synthesis subsystem. Three variants of the speech synthesis subsystems differing in the method of speech synthesis are developed. In the first the unit selection method is used, where synthetic speech is generated by concatenating units of natural speech. In the second, statistical parametric method is used, where speech is generated using stochastic models of speech. In the third, hybrid approach, a method for unit selection is proposed where potential candidate unit chains are scored according to statistical models of speech. In the conventional unit selection approach, a target pronunciation is set first, and then a chain of units from the database is selected that best fits the target according to a cost function. However, if there are no chain of units in the database that fit the target pronunciation it is still possible that a different pronunciation that can be realised with available units would still sound natural. In the hybrid system, this is achieved using a two-step unit selection procedure, where in the first step candidate chains are selected primarily based on cost of joining consecutive units, while in the second step final selection is made using statistical models of fundamental frequency. In Chapter 7 the results of evaluation of the synthetic speech are presented. A formal subjective evaluation of speech quality, intelligibility, naturalness and appearance of irregularities in speech was conducted. For texts in domain of learning corpus, a variant of the system based on unit selection was rated best, while for out-of-domain texts the best score was achieved for the hybrid system. Subjective evaluation can be inconvenient to perform in various stages of system development, since it can take long, be expensive and results may vary between runs. Objective evaluation procedures that can be done quickly and with consistent results between runs are favoured in that case. In this work, a measure based on automatic speech recognition (ASR) was proposed for automatic objective evaluation of speech intelligibility and was applied for the problem of parameter optimization of the hybrid system. A subjective evaluation confirmed a correspondence between the results of automatic recognition and human perception, and an improvement in synthetic speech quality after optimization. The thesis concludes with Chapter 8 where the contributions of the thesis and possible future work is presented

    Active Player Detection in Handball Scenes Based on Activity Measures

    No full text
    In team sports training scenes, it is common to have many players on the court, each with his own ball performing different actions. Our goal is to detect all players in the handball court and determine the most active player who performs the given handball technique. This is a very challenging task, for which, apart from an accurate object detector, which is able to deal with complex cluttered scenes, additional information is needed to determine the active player. We propose an active player detection method that combines the Yolo object detector, activity measures, and tracking methods to detect and track active players in time. Different ways of computing player activity were considered and three activity measures are proposed based on optical flow, spatiotemporal interest points, and convolutional neural networks. For tracking, we consider the use of the Hungarian assignment algorithm and the more complex Deep SORT tracker that uses additional visual appearance features to assist the assignment process. We have proposed the evaluation measure to evaluate the performance of the proposed active player detection method. The method is successfully tested on a custom handball video dataset that was acquired in the wild and on basketball video sequences. The results are commented on and some of the typical cases and issues are shown
    corecore